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ABSTRACT 

Making judicious channel access and transmission scheduling de¬ 
cisions is essential for improving performance (delay, throughput, 
etc.) as well as energy and spectral efficiency in multichannel wire¬ 
less systems. This problem has been a subject of extensive study in 
the past decade, and the resulting dynamic and opportunistic chan¬ 
nel access schemes can bring potentially significant improvement 
over traditional schemes. However, a common and severe limita¬ 
tion of these dynamic schemes is that they almost always require 
some form of a priori knowledge of the channel statistics. A nat¬ 
ural remedy is a learning framework, which has also been exten¬ 
sively studied in the same context, but a typical learning algorithm 
in this literature seeks only the best static policy (i.e., to stay in the 
best channel), with performance measured by weak regret, rather 
than learning a good dynamic channel access policy. There is thus 
a clear disconnect between what an optimal channel access pol¬ 
icy can achieve with known channel statistics that actively exploits 
temporal, spatial and spectral diversity, and what a typical exist¬ 
ing learning algorithm aims for, which is the static use of a single 
channel devoid of diversity gain. In this paper we bridge this gap 
by designing learning algorithms that track known optimal or sub- 
optimal dynamic channel access and transmission scheduling poli¬ 
cies, thereby yielding performance measured by a form of strong 
regret, the accumulated difference between the reward returned by 
an optimal solution when a priori information is available and that 
by our online algorithm. We do so in the context of two specific 
algorithms that appeared in m and respectively, the former for 
a multiuser single-channel setting and the latter for a single-user 
multichannel setting. In both cases we show that our algorithms 
achieve sub-linear regret uniform in time and outperforms the stan¬ 
dard weak-regret learning algorithms. 


Categories and Subject Descriptors 

F.1.2 [Modes of Computation] : Online Computation; C.2.1 [Network 
Architecture and Design] : Wireless Communication; G.3 [Probability 
and Statistics]: Distribution Functions 
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1. INTRODUCTION 

Making judicious channel access and transmission scheduling 
decisions is essential for improving performance (delay, through¬ 
put, etc.) as well as energy and spectral efficiency in wireless 
systems, especially those consisting of multiple users and mul¬ 
tiple channels. Such decisions are often non-trivial because of 
the time-varying nature of the wireless channel condition, which 
further varies across different users and different spectrum bands. 
Such temporal, spatial and spectral diversity provide opportunities 
for a radio transceiver to exploit for performance gain and the past 
decade has seen many research advances in this area. For instance, 
a transmitter can seek the best channel through channel sensing be¬ 
fore transmission, see e.g., OHa for such dynamic multi-channel 
MAC schemes that allow transmitters to opportunistically switch 
between channels in search of good instantaneous channel con¬ 
dition; if a transmitter consistently selects a channel with better 
instantaneous condition (e.g., higher instantaneous received SNR) 
from a set of channels, then over time it sees (potentially much) 
higher average rate ijil- Similarly, a transmitter can postpone 
transmission if the sensed instantaneous condition is poor in hopes 
of better condition later, see e.g., m for stopping rule based se¬ 
quential channel sensing policies, in the single-user multichannel 
and single-channel multiuser scenarios, respectively. Variations 
on the same theme include (9) where a distributed opportunistic 
scheduling problem under delay constraints is investigated, and 13 
where a generalized stopping rule is developed for the single-user 
multichannel setting. 

These dynamic channel access schemes (both optimal and sub- 
optimal) improve upon traditional schemes such as channel split¬ 
ting (TOlITTl, multi-channel CSMA fT3 . and multi-rate systems 
(B) However, a common and severe limitation of these dynamic 
schemes is that they almost always require some form of a priori 
knowledge of the channel statistics. For instance, a typical assump¬ 
tion is that the channel conditions evolve as an IID process and that 
its distribution for each channel is known to the transmitter/user, 
see e.g., GHHIll. While in some limited setting such information 
may be acquired with accuracy and low latency, this assumption 
does not generally hold. Furthermore, the channel statistics may be 
time-varying, in which case such an assumption can only be justi- 


fied if there exists a separate channel sampling process which keeps 
the assumed channel statistics information up to date. 

To relax such an assumption, it is therefore natural to cast the 
dynamic channel sensing and transmission scheduling problem in 
a learning context, where the user is not required to possess a priori 
channel statistics but will try to learn as actions are taken and obser¬ 
vations are made. Within this context, the type of online learning 
or regret learning, also often referred to as the Multi-Armed Bandit 
(MAB) (unsi framework, is particularly attractive, as it allows 
a user to optimize its performance throughout its learning process. 
For this reason, this learning framework has also been extensively 
studied within the context of multichannel dynamic spectrum ac¬ 
cess, see e.g., im for single-user and mm for multiuser settings. 
Flowever, in most of this literature, the purpose of the learning al¬ 
gorithm is for a transmitter to find the best channel in terms of its 
average condition and then use this channel for transmission ma¬ 
jority of the time. It follows that the performance of such learning 
algorithms is measured by weak regret, the difference between a 
learning algorithm and the best single-action policy which in this 
context is to always use the channel with the best average condi¬ 
tion. Accordingly, the key ingredient in these algorithms is to form 
accurate estimates on the average condition for each channel. 

We therefore see a clear disconnect between what an optimal 
channel access policy can achieve with known channel statistics 
(e.g., by employing a stopping rule based algorithm) that actively 
exploits temporal, spatial and spectral diversity, and what a typical 
existing learning algorithm aims for, i.e., essentially the static use 
of a single channel, which unfortunately completely eliminates the 
utilization of diversity gairQ. 

Our goal is to bridge this gap and seek to design learning al¬ 
gorithms that instead of trying to track the best average-condition 
channel, attempt to track a known optimal or sub-optimal chan¬ 
nel access and transmission scheduling algorithm, thereby yielding 
performance measured by a form of strong regret. Our presentation 
and analysis strongly suggest that such learning algorithms may be 
constructed in a much broader context, i.e., they can be made to 
track any prescribed policy and not just those cited earlier or even 
limited to the dynamic spectrum access context. However, to make 
our discussion concrete, we shall present our results in the context 
of specific channel sensing and access algorithms. 

Specifically, we present the general framework of such a learn¬ 
ing algorithm, followed by the detailed instances designed to track 
the stopping rule policies given in (T) and |3, respectively. The 
choice of these two algorithms is not an arbitrary one. Our in¬ 
tention is to use two representatives to capture a fairly wide ar¬ 
ray of similar algorithms of this kind. The stopping rule algorithm 
in jT) is a relatively simple one, designed for multiple users compet¬ 
ing for access to a single channel; it exploits temporal and spatial 
(multiuser) diversity, the idea being for a user to defer transmis¬ 
sion if it perceives poor channel quality thereby giving the oppor¬ 
tunity to another user with better conditions. The stopping rule 
algorithm in |3, on the other hand, is much more complex in con¬ 
struction; it is designed for a single user with access to multiple 
channels by exploiting spectral and temporal diversity, the idea be¬ 
ing to find the channel with the best instantaneous condition. Both 
algorithms assume that channel qualities evolve in an IID fash¬ 
ion with known probability distributions, though different channels 
may have different statistics (2); and both are provably optimal (or 
near-optimal) under mild technical conditions. For other stopping- 
rule based policies see also (US). We show that in both cases our 

'Some multiuser learning algorithms attempts to separate users 
into different channels, so do exploit to some degree the multiuser 
diversity gain, see e.g., Ho). 


algorithms achieve a sub-linear accumulative strong regret (against 
their respective reference algorithms from □ and H), thus achiev¬ 
ing zero-regret averaged over time. In this paper, we do not con¬ 
sider interferences from multiple users, that is we consider cases 
with either a single user or non-strategic and collaborative users. It 
is however another interesting direction of applying regret learning 
results to scheduling problems. In such case, adversarial models 
will be needed to capture the effects of interference when multiple 
transmitters present in the system. In particular, in dD Asgeirs- 
son et al. studied a capacity maximization problem in distributed 
wireless network under SINR interference model and show a con¬ 
stant factor approximation bound compared to the global optimum 
is achievable. In du Dams et al. proposed scheduling algorithms 
for a similar problem but under Reyleigh-fading interference mod¬ 
els and show a logarithmic order approximation. Then in a later 
work dU, the same authors extend their results to when there ex¬ 
ists adversarial jammer. 

The rest of the paper is organized as follows. Problem formula¬ 
tion is presented in Section|2] and the two reference optimal offline 
algorithms in Section]^ We present our online learning algorithms 
in Section|4]with performance analysis given in Section[5] Numer¬ 
ical results are given in Sections|^and we discuss several possible 
extensions of our work in SectionQ Section[8]concludes the paper. 

2. PROBLEM FORMULATION 

In this section we present two system models and their corre¬ 
sponding transmission scheduling problems. This lays the founda¬ 
tion for us to introduce the two offline optimal stopping-rule poli¬ 
cies from (T) and respectively in Section[3 these are the poli¬ 
cies our learning algorithm presented in Section|4]aims to track. 

2.1 Model I: multiuser, single-channel 

Under the first model (studied in ID), there is a finite num¬ 
ber of users/transmitters, indexed by the set 9*1 = {1,2,...,M}, 
M > 1, and a single channel. The system works in discrete time 
slots indexed by n = 1,2, • • •. Denote the channel quality by (n), 
n= 1,2, • • •. This quantity measures how good a channel is; for 
example, X{n) could model the Signal-Noise-Ratio (SNR) for the 
channel at time n. At time n, if no one is transmitting on the chan¬ 
nel, a user i e M attempts to access with probability 0 < p/ < 1 by 
sending a carrier sensing packet. A carrier sensing period takes a 
constant amount of time denoted by ^ (slots). The contention reso¬ 
lution is done by random access, i.e. an access attempt is successful 
with probability ps = Pi' njy/(l ~ Pi), when there is only 
one user attempting access. Denote the random contention time 
between two successful accesses by T|; it follows that iilri] = Ps 
(slots). We assume the process {X(ni,)}i^^i 2 forms an IID pro¬ 
cess, where nj^ is the time the k-th contention succeeds. That is we 
assume the samples collected at successful accesses are generated 
in an IID fashion (as assumed and argued in (!]). Upon a collision, 
the current slot will be abandoned and users re-compete in the next 
time slot. On the other hand, users keep silent if there is an active 
transmission on the channel. For simplicity it is assumed that X (n) 
stays unchanged during each transmission, which may be justified 
if transmission times are kept on a smaller time scale than channel 
coherence times (H. Once a user gains access right (and sees the 
channel quality X (n)), it has two options: 

• access the channel right away for K time slots (stop); or 

• give up the access opportunity, release the channel for all 
users to re-compete {continue). 

This can be more formally stated as an optimal stopping rule (OSR) 
problem: users decide at which time to stop the decision process 






and use the channel. There are a number of variations of this prob¬ 
lem with slightly different model, see e.g., I?). The idea is when 
the channel quality is poor, a user would give up the transmission 
opportunity so that it is more likely that a user with better perceived 
channel quality will get to use it. Denote the stopping time by x, 
then the objective is to design a stopping rule for all users so as 
to maximize the rate-of-retum, which is the effective data rate for 
each successful access (CD 
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where If is the strategy space and T|j. is the k-th contention time 
and Kt: = is the total amount of time spent for each 

successful transmission. 

In this model, we regard the decision process between two con¬ 
secutive successful transmissions (note that no successful transmis¬ 
sion occurs if a user who wins access forgoes the transmission op¬ 
portunity) as one meta stage. Suppose there are all together H meta 
stages (thus H successful transmissions). We define the following 
strong regret performance measure. 
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In above formulation, since channel conditions are IID over time, 
for each meta stage we restart the clock, i.e., we always set the first 
time slot for each meta stage as n = 1. a; is the stopping time for 
the /-th meta stage and r|;t) is the corresponding reward. 

Here we denote by the := the set of observations of 

channel qualities at meta stage / ( with for each user j). 


2.2 Model II: single-user, multichannel 

Under the second model (studied in (3), there is a finite num¬ 
ber of channels, denoted and indexed by O = {1,2, ...,At}, each 
of which yields a non-negative reward when selected for transmis¬ 
sion (e.g., throughput, delay etc). For any subset 5 C O we will 
use 0 — 5 to denote the set {_/ : j E O & j ^ S}. There is one 
decision maker (user/transmitter) within the system. The system 
again works in discrete time slots n = l,2,...,x < A^; these how¬ 
ever are much smaller time units than those under Model I because 
they are used only for channel sensing and not transmission. The 
user sequentially chooses a set of channels to probe for their con¬ 
dition, stops at a stopping time x using certain stopping rule, and 
selects a channel for transmission (over a period of time larger than 
a slot). The decision process thus consists of determining in which 
sequence to sense the channels, when to stop, and which channel to 
use for transmission when stopping. 

For consistency we reuse the terminology meta stage to describe 
the above decision process between n = 1 and x; this will be re¬ 
ferred to as one meta stage. Each time a new meta stage starts 
the clock is reset to n = 1. The meta stages are indexed by t = 
l,2,...,r. There is a period of transmission between two succes¬ 
sive meta stages. It is assumed that the channel condition remains 
constant within a single meta stage and forms an IID process over 
successive meta stages. This is modeled by a reward X, (to generate 
{X;(?)},) for the i-th channel given by a pdf /x,(-) and cdf £x,(')’ 
respectively. Channels are independent of each other, i.e., a specific 


channel i’s realization A',(t;co) does not reveal any information for 
channels in O— {i}. 

The transmitter is able to sense one channel (to observe Xj) at 
each decision step n with a finite and constant sensing cost c, > 0 
for each channel i. The system works in the following way at each 
n of meta stage f: The transmitter makes a decision between the 
following choices: 

• continues sensing; if this is the case then furthermore decide 
which channel to probe next (sense)', 

• stops sensing and proceeds to transmit (access). Under this 
case there are two more options to choose from: 

- access the channel with the best observed instantaneous 
condition (access with recall)', 

- access the best channel (with highest expected reward) 
from the un-probed set without sensing (access with 
guess). 

For the offline problem, due to the IID assumption on the channel 
condition, the decision strategy at each meta stage t is the same. 
We thus suppress the time index f; the transmitter’s objective is to 
choose the strategy that maximizes the collected reward minus the 
sum of probing costs: 
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where 7t denotes a probing strategy and x the stopping time. From 
1^ . it can be shown for time slots n = 1,2, ...,x at any meta stage t, 
a sufficient information state is given by the pair (x(n),5„) where 
Sn is the un-probed channel set and x{n) is the highest observed re¬ 
ward among the set of probed channels 0 — S„. Let V(x,5) denote 
the value function, the maximum expected remaining reward given 
the system state is (x,5), the problem/decision process at the «-th 
decision step is equivalent to the following dynamic programming 
(DP) formulation 

V{x{n),Sn) = max< max{—c/-|-£[V'(max{x(n),2f;},5,! —y)]} 

L Jes„ 

,x(n),max£[A:y]| , (3) 

yes.. J 

where the three terms on the RHS correspond to the decision op¬ 
tions sense, access with recall and access with guess, respectively. 

Our goal is to design an online algorithm at,t = 1,2,..., £ based 
on past observed history )Ff-i,..., iTi, so as to minimize the follow¬ 
ing strong regret measure. 


r T x-1 


Rn{T) = sup 



-k.sTP' 

■t=l n=l 



r T 

x-1 

1 

-£“ 

-r=l «=1 



( 4 ) 


where Tit is the optimal decision at meta stage t when the infor¬ 
mation on {Xi}i^Q is known and x the stopping time; tt,(n),n = 
1,2, ...,x are the channels selected at decision step n of each meta 
stage t. a, is the decision actually made at t by the user based on 
past observations when channel statistics is unknown. 

For both problems, if an algorithm can achieve regret (re¬ 
spectively > 0 then it is called sub-linear in total regret and 

zero-regret in time average (optimal asymptotically). 




















3. OFFLINE SOLUTIONS REVISITED 

To be self-contained as well as to provide certain intuition for 
the design of the online algorithms, helow we present the optimal 
offline solutions to the scheduling problems in Model I and Model 
II respectively. 

3.1 Algorithm description: Model I 

The solution for the scheduling problem in Model I is surpris¬ 
ingly clean and elegant, and can be easily described as follows. 
Within each meta stage /, the optimal stopping rule is given hy a 
threshold policy (T) : 

X* = min{n > I : X{n) > x*} , (5) 

where x* is given hy the solution for u in the following equation: 

E[X{n) - u]+ = . (6) 

The corresponding algorithm is straightforward: at each n when a 
user needs to make a decision, if X{n) > x*, a user will transmit 
and otherwise will release the channel. Intuitively this says that 
when the channel quality is sufficiently good (as compared to x* 
which separates the decision regions for stop and continue), a user 
should transmit. This algorithm will be referred to as Offline_MU 
(Multiuser) in our subsequent discussion. 

3.2 Algorithm description: Model II 

The solution for Model II is much more involved; this is primar¬ 
ily due to it allowing access with guess as an option, which is very 
different from classical stopping time problems. In this sense this 
model presents a generalization. The optimal policy is shown to 
have three major steps in i): parameter calculation, sorting, and 
decision making, as detailed below. 


STEP 1: Parameter calculation, Vy e O 

aj = mm{u : u > E\Xj\,Cj > E\msx{Xj — m,0)]} , 
bj = max{M : ii < E\Xj\,Cj > £[max(n — Xy,0)]} . 


STEP 2: Channel sorting 

1: Initialize /: = 1, S = O. 

2: First compute ^ := < y G 5,aj = max/g^a; > , and then j*: 


j* ■ E[Xj] +4,>i, ■ 


E[Xj\Xj > aj] 


P{Xj>aj) 


3: Let = j* (randomly select one if multiple j* exists) and set 
k:=k+l. S = S-{f}. 

4: If |5| > 1, repeat 2; o.w. return the sorted set {o\, 

5: Relabel the sorted set as {1,2, ...,A^}. 


STEP 1 is based on a threshold property for the optimal policy 
proved in m Intuitively speaking, a,b separate the decision re¬ 
gion as follows. A state larger than ay means further probing is not 
profitable whereas a state below b y suggests gain from continued 
sensing. For STEP 2 we refer to each of its sub-steps m as STEP 
2.m (we will re-use this numbering style in later discussions). The 
sorting process is straightforward: we start with the full set O and 


at each step we first calculate 1^, the set of channels with the high¬ 
est aj. Then within ^ we further order the channels based on the 
one-step reward of probing channel y when x{n) = aj and y being 
the only remaining channel. The ordering repeats until all channels 
are in order. 

Given the current information state is (x(n), 5„) at decision epoch 
n and denoting by rfj the solution to the following equation (solu¬ 
tion is guaranteed to exist 1^): 

V{0,S„) = -Cl +£[F(max{d„Xi},5„ -{!})] , 


STEP 3 : Decision Making 

1: If x(n) > maxi^s„ tii, stop and access the best sensed channel. 
2: Otherwise if x{n) >ds, probe the first channel in S„. 

3: If x(n) < ds consider the following sub-cases 

(1) :Ifhi > 02 , then access/guess 1st channel (in w/o 
sensing). 

(2) : If h 2 > bi or gi (0) > max{£[ 2 ('i],g 2 ( 0 )}, probe 1 in 

(3) : There exists a unique bQ, where hj > hg > b 2 and 
gi(ho) =max{£[Xi],g2(0)}. Ifx(n) > bi) : probe 1st channel. 
x{n) < fcg: guess channel 1 if £[3 li] > g2(0); probe channel 2 o.w. 


whereg;(x) = —c/-|-£[V'(max(2i',',x), —i-|-3)], i = 1,2, andgi(x) is 
the expected reward of probing channel 1 facing information state 
(x, {1,2}) while g 2 (x) is the reward for probing channel 2. 

We denote the algorithm consisting of (STEP 1, STEP 2, STEP 
3) as Offline_MC (Multichannel) and it serves as the offline bench¬ 
mark solution for the multichannel scheduling problem. 

4. DESIGN OF ONLINE ALGORITHMS 

We detail our online learning algorithm in this section. To gen¬ 
eralize the discussion we shall refer to the users in Model I and the 
channels in Model II as units. Then for a unifying framework of the 
online learning process there are two main phases : exploration and 
exploitation which can be described as follows: (1) Exploration: 
sample the units with sampling times less than Di (t) = L-E ■ logt 
up to meta stage f, with L> 0, 0<z<l being constant parame¬ 
ters. Here L is a sufficiently large (we shall specify its bounds later 
alongside the analysis) exploration parameter. When the unit repre¬ 
sents a channel, the sampling process is to probe the channel qual¬ 
ity; when such an unit represents a user, the process corresponds 
to letting the user gain access to the channel to gather samples. (2) 
Exploitation: execute the optimal scheduling policy using collected 
statistics as detailed in the offline solution, but with built-in toler¬ 
ance for estimation errors as detailed below. The sensing results 
(possibly multiple) from exploitation phases will also be collected 
and utilized for training purpose. 

The above steps are rather standard within the regret learning lit¬ 
erature: when a unit has not been explored/sensed sufficiently (e.g., 
a user has not accessed a channel for sufficient number of times in 
Model I or a channel has not been sampled sufficiently in Model 
II), the algorithm enters the exploration phase. Otherwise the al¬ 
gorithm mimics the procedures of calculating the optimal strate¬ 
gies as detailed in the offline solutions but with empirically esti¬ 
mated channel statistics. One notable difference here is that since 
the offline dynamic policies involve channel sensing as part of the 
decision process, effectively additional samples are collected dur¬ 
ing exploitation phases and used toward estimation. The general 
framework of this online approach is summarized as follows. 

The exploitation phase is intended for the algorithm to compute 
and execute the optimal offline strategy using statistics collected 
during the exploration phase. However, due to the estimation error. 















Online Solution : A unifying framework 

1: Initialization'. Initialize L,z,t = 1 and sample each unit at least 
once. Update the collection of sample as iFo and the number of 
samples for each unit j as nj{t). 

2: Exploration: At stage t, if E(f) := {j : njit) < D\ (f)} ^ 0, sense 
the set E(f) of units. 

3: Exploitation: If E(r) = 0, calculate the optimal strategy accord¬ 
ing to steps in the corresponding offline algorithm (with relaxation) 
based on collected statistics . 

4: Update: t := f-f 1; update sample set and for sampled unit j 
update n/(?) := nj{t) -f 1. 


Figure 1: A unifying framework 


Online_MU : Algorithm details 

1; Initialization: Initialize L,z,I = 1 and let each user access the 
channel once. Denote the collected sample for user j at stage / as 
if/. Update number of collected samples nj(0) = 1 and IFq . 

2: Exploration: At stage /, let E(Z) := {j : nj{l) < Di (/)}. At any 
decision epoch, if E(/) ^ 0 and let user j e E(/) transmit right 
away. If multiple such j exist, a user is selected randomly from 
E(/). 

3; Exploitation: Otherwise if E(/) = 0, calculate the optimal 
threshold x* according to Eqn. © using collected statistics 
{ IF- * for each user j and follow the scheduling strategy detailed 

in Offline_MU. 

4: Update: I : = Z -f 1; for user j who accessed the channel update 
nj{l) := nj{l) 1 and its sample set {ir/}|-^. 

Figure 2: Online_MU 

the executed version has to made error tolerant, e.g., by relaxing 
the conditions for the steps involving strict equalities. We show 
how this relaxation is done for the problem in Model II below. 

We now detail the online counterparts for Offline_MU and Of- 
fline_MC by filling in the details into above general framework. 
As a notational convention, we will denote by y the estimated ver¬ 
sion of y, tj (k) the meta stage when the k-th sample is collected for 
channel j and E\X\ the sample mean of X. 

In Online_MC, besides the clear separation between exploration 
and exploitation phases, several relaxations are invoked and the re¬ 
laxation term ^ could be viewed as the tolerance/confidence re¬ 
gion. This tolerance region decreases in time t and approaches 0 
asymptotically as the estimation errors decrease as well. There is 
an inherent trade-off between exploration and the tolerance region. 
With more exploration steps (a larger z), a finer degree of tolerance 
region could be achieved. We shall further discuss the roles of z in 
the analysis. 

5. REGRET ANALYSIS 

In this section we analyze performance of the online algorithm. 
We present the main results for both Online_MU and Online_MC. 
Since Online_MC is a much more complex algorithm and its anal¬ 
ysis can be easily adapted for Online_MU, as well as for brevity, 
we will only provide details for Online_MC. 

Before formalizing the regret analysis for Online_MC, we out¬ 
line the key steps. The regret consists of two parts; that incurred 
during exploration phases and that during exploitation phases. For 
exploration regret, we will try to bound the number of exploration 


Online_MC: Algorithm details 

1: Initialization: Initialize L,z,t = 1 and sense each channel once. 
Update the number of channels being sensed and observed as 
nj{t) = 1. Update the collection of sample for each channel j as 

2: Exploration: At meta stage f, if E(t) := {j : nj{t) < Di [t)} ^ 0, 
sense the set E(f) of channels sequentially and choose the one with 
best instantaneous condition. 

3: Exploitation: If E(t) =0 calculate the optimal strategy accord¬ 
ing to steps in Offllne_MC (with relaxation) based on collected 
statistics {TtYp-Q 3® follows: 

Online.STEP 1: Calculate aj,bj according to the follows 

1 

dj = min{u : u > E\Xj\,Cj -I- £'[max(Aj — m,0)]} , 

bj = max{j<: u < E\Xj\^Cj + > £[max(M — Aj,0)]} . 

Online.STEP 2: Follow STEP 2 of Offline_MC but with the 
following relaxation 

^ = |; e 5, \dj - maxa,| < ^ | • 

Online.STEP 3: Follow STEP 3 of Offline_MC but with the 
following relaxation 

bi>a2--^: (3.3.1); b2>h-^: (3.3.2); 

gi(0) >max{£[Ail,g2(0)}-^ : (3.3.2) . 

4; f/prfate: r := f-b 1; for sensed channel j update nj{t) := nj{t) +1 
and sample set IFt. 


Figure 3: Online_MC 


steps that are needed. For the exploitation phase, the regret is deter¬ 
mined by how accurate decisions are made using estimated values. 
Specifically, Online.STEP I does not have a decision making step 
as it is simply a calculation, though we will show later in the proof 
the calculation of {aj,bj} doss play an important role in the 
sorting and decision making process. In Online.STEP 2 if the sort¬ 
ing is done incorrectly then this could lead to error in Online.STEP 
3. as all decision making and sensing orders are based upon the 
ordering of the channels. Online.STEP 3 has the following error: 
(1) error in the calculation of aj,bjS, (2) error in calculating ds, and 
(3) error in calculating a set of value functions for sub-step 3.3. 

5.1 Assumptions 

We state a few mild technical assumptions. We will assume 
non-trivial channels, i.e., E[Xj] > 0,V_/ G O, so that they all have 
positive average rates. We will also assume all channel realiza¬ 
tions are bounded, i.e., finite support over all channel condition, 
0 < supyg(j(„Ay(f;co) < °o, Vt, CO being an arbitrary channel re¬ 
alization. This is not a restrictive assumption since in reality the 
transmission rate is almost always non-trivial and bounded. 
Moreover denote 

A* = max \Xj(f,aj) — A';(f;C0/)| + Y c,- . 

A* can be viewed as an upper bound for a one step loss when a 
sub-optimal decision is made and A* < -foo (note c,s are finite). 
Finally, we assume the cdf of each channel Z’s condition satisfies the 
Lipschitz condition, i.e., there exists E(note different from L),a> 











0 such that 

|^Xi(jt + 8) -^XiWI < i:- |5|“,V/,x,8 . 

The Lipschitz condition has been observed to hold for various dis¬ 
tributions, for example the exponential distribution and uniform 
distribution (25] 

5.2 Main results for Online_MC 

We first separate the regret for different phases. We have the 
following simple upper bound on the regret Rjj{T), 

Rn{T) =R,{T)^Rs{T)<R,{T)~{^{R 2 {T)^Ri{T)^ . 

The first term Re { T ) is the regret from exploration phases. RsiJ ) is 
the regret from exploitation which could be further upper bounded 
by the two terms from Online.STEP 2&3 of OnIine_MC respec¬ 
tively: R 2 {T) comes from the sorting procedure and R^iT) comes 
from the last step of decision making. Notice for Online.STEP 1 
there is no direct regret incurred by parameter calculation: the er¬ 
rors in the calculation are reflected in Online.STEP 2&3 later. The 
idea of upper bounding the regret by a union bound will be repeat¬ 
edly utilized in the following analysis. For example, we can show 
that the regret in each step above can again be upper bounded by 
the sum of regrets of each of its sub-steps. Therefore we will not 
restate the details of the bounding for the rest of the proof. Denote 
the sum SpiT) := YJ=i Jp- We have our main result for the regret 
analysis summarized as follows. 


5.4.1 Exploitation regret for Online.STEP 2 

We bound the regret associated with the sorting process of On¬ 
line.STEP 2. Details can be found in the Appendix. 

Lemma 3. Regret R 2 {T) is bounded as follows, 

R2(T)<A*-2-NY^^. 

f=l ‘ 

The main challenge in this proof is to relate the sampling uncer¬ 
tain to the ones in our decision making process. First of all we could 
show the calculation of E [Xj], aj , bj can fall into certain confidence 
region when the number of exploration steps are large enough (L). 
Moreover the estimation errors of , bj are proportional to the one 
for E[Xj\. Intuitively this is due to the calculation of Oj.bj which 
relates to the calculation of E\Xj\ in a piece-wise linear way. Next 
consider calculating j*. There are potentially two types of errors. 
First is the decision error associated with the decision process of 
telling whether the following holds 1- _r . To bound the error of 
making the wrong call, we are going to show when aj = bj, we 
could bound the probability of aj f bj. Alongside the binary deci¬ 
sion making, we also have the estimation error for Online.STEP 2 
for terms such as E [Ay] ,E\Xj\Xj > fly], pixfa ) ' 

5.4.2 Exploitation regret for Online.STEP 3 

We bound the regret associated with the decision making step 
{Online.STEP 3). Details can be found in the appendix. 


Theorem 1. There exists a constant L such that the regret for 
Online_MC is bounded by 


Lemma 4. Regret R^iT) is bounded as follows, 


RiiT) <A*-(ci ■s^.,j2{T) +C*2-S2{T)) , 


R,i{T)<A*iNLTHogT-yCi •i„.,/ 2 (r)+C 2 \ , 


where Ci, C| are positive constants. 


time uniformly, where Ci,C 2 >0 are constants. 


Here L is larger than a certain positive constant which we de¬ 
tail later. It is easy to notice since T^logT and 80 th 

sub-linear terms (^ci.t/z is on the order of 1 — while S 2 {-) is 
bounded by a constant since Sp{T) < oo,Vp > 1,TRjj{T) is also 
sub-linear and asymptotically we achieve zero-regret on average 
{iimT^ooRii(T)/T = 0). The first term T^logT is due to the ex¬ 
ploration while the term i'oi.j /2 comes from exploitation. Clearly 
we see with a larger z (more exploration invoked), we will have 
a larger regret term from exploration phases; however the regret 
for exploitation will decrease. The balanced setting is achieved at 


z = 1 - 


T- 




2 

2+a 


5.3 Bounding exploration regret 

We start with bounding the exploration regret Re { T ). 

Theorem 2. The exploration regret Re ( T ) is bounded as 

Re { T )< DfT )- NA * . (7) 

Proof. Notice since the exploration phase requires Dj (T) sam¬ 
plings for each channel up to time T , we know there are at most 
N -Di ( T ) exploration phases being triggered. For each exploration 
phase, the regret is bounded by A*, completing the proof. □ 


5.4 Bounding exploitation regret 

We next consider regret incurred during exploitation phases. 


The proof is obtained by bounding the decision errors in each of 
the sub-steps Online.STEP 3.1, 3.2, 3.3. The technical challenges 
again come from bounding the errors with calculating various pa¬ 
rameters in the decision making steps, including for instance ds and 
the value function V(, ■, )s. 

Combine Re{T),R 2 (T),R 2 {T) we have our main results. 


5.5 Discussion on parameter l 

In most of our proved results, we assumed L to be significantly 
large. We summarize the actual conditions on L below (please refer 
to the appendix for details): 


(Condition 1) : L > max{4, 2 ^ 

2maxycyy 

(Condition 2) : L > 1 /e^ , 


where {ci j}ygo is a set of positive constants and Eo is a solution of 
e for C • (e + X • (ci,y + l)“e“) < ^ ^here C is a positive 

constant and £3 = miny^,t £4 = miny^,^ I " 

I assume £ 3 , £4 > 0 ). 

From (Condition 1) we know when {a,}s are closer to each 
other, L should be chosen to be larger. Also from (Condition 2) 
we know when channels’ expected reward E\Xj\ and (can 

be viewed as potential term when sensed) are closer to each other, 
again L should be chosen to be larger. The intuition here is that 
in such cases a larger L can help achieve higher accuracy for the 
estimations to differentiate two channels that are similar. 








The selection of L depends on a set of es, which further depends 
on statistical information of Xjs (though weaker as we only need 
to know a lower bound of them) which is assumed to be unknown. 
However, following a common technique 1261 . the assumption can 
be further released but with potentially larger regret. In particular 
one can show that at any time t with L being a positive constant 
the estimation error Ef for any terms (e.g., a,b or satisfy the 

following, P(Zt >je)<^ , withe, V > 1. Therefore with the error 
region Ef being small enough, there would be no error associated 
with differentiating the channels of the algorithm. Thus there ex¬ 
ists a constant Tg such that, Ef < minE,Vt > Tq- Consider the the 
case Er < ^. Since when the error happens under this case, two 
estimated terms (the sub-optimal and optimal one) are separated by 
at most 2E/. The probability of the corresponding term falls into 
this region is bounded as | (x -f Ej) — Tx,. (jc — E/) | < X • 2“ ■ ^ by 
the Lipschitz condition. Therefore we have the extra error bounded 
by -C • 2“ • ^, which is a constant growing sub-linearly up to 
time Tq. 

5.6 Main results for Online_MU 

For Online_MU we can similarly prove the following result 

Theorem 5. There exists a constant L such that the regret for 
OnUne_MU is bounded by 

Ri{H)< A*log// + Cl • s^.,i 2 {H) +C 2 -S 2 (H )| , 

time uniformly, where Ci,C 2 >0 are constants. 

Notice though R/{H) looks similar to Rjj{T), they may have very 
different parameters for each term, i.e, Ci,C 2 may be quite differ¬ 
ent from Ci,C 2 , as well as different constraints for L due to the 
different statistical structure of the two problems. Again the first 
term is coming from exploration phases, the second term due to in¬ 
accurate calculations of x* and last term bounds the event that x* is 
too different from x*. 

6. SIMULATION 

In this section we show a few examples of the performance of the 
proposed online algorithm via simulation. We measure the average 
regret rate Rj{l)/l{Rjj{t)/t) and compare our performance to the 
optimal offline algorithm, a static best single channel policy, as well 
as that of a weak-regret algorithm. 

For simplicity of demonstration we assume channel qualities fol¬ 
low exponential distribution but with different parameters 0. The 
corresponding distributions’ parameters are generated uniformly 
and randomly between [0,0.5]. Users’ attempt rate p,s are uni¬ 
formly generated in the interval [0,0.5] (in Model I). The costs for 
sensing the channels (in Model II) are also randomly generated ac¬ 
cording to uniform distribution between [0,0.1]. In the following 
simulation for Model I we have M = 5 users while for Model II we 
have N = 5 channels. Simulation cycle is set to be // = T = 4,000. 
In the set of results for performance comparison with offline solu¬ 
tions, we set the exploration parameters as/,= 10,z=l/5. Later 
on we show the performance comparison w.r.t. different selection 
of L and z- 

6.1 Comparison with Offline Solution 

We first take the difference between the oracle (Offline_MU) 
and Online_MU at each step t and divide it by t (i.e., we plot 

^We have similar observations for other distributions. The details 
are omitted for brevity. 


Rj{t)/t). This regret rate is plotted in Figure|4]and clearly we see 
a sub-linear convergence rate. We repeat the experiment for On- 
line_MC and the regret convergence is shown in Figure [0 which 
validate our analytical results. To make the comparison more con¬ 
vincing, we compare the accumulated reward between Online_MC, 
Offline_MC and the best single-channel (action) policy, which al¬ 
ways selects the best channel in terms of its average rate (channel 
statistics is assumed to be known a priori) in Figure |0 In particu¬ 
lar we see the accumulated rewards of Online_MC (red square) 
is close to the performance of the oracle (blue circle) who has 
all channel statistical information and follows the optimal decision 
process as we previously depicted in Offline_MC. We observe the 
dynamic policies clearly outperform the best single channel policy. 



Figure 4: Convergence of average regret: OnIine_MU 



Figure 5: Convergence of average regret: OnIine_MC 



Figure 6: Online_MC v.s. Offline_MC v.s. Best single 


















6.2 Comparison with naive reinforcement learn¬ 
ing solution 

As we mentioned earlier in the introduction, there exist online 
solutions for a user to find the best channel in terms of its average 
condition (minimizing weak regret). We demonstrate the advan¬ 
tages of our proposed online algorithm with a comparison between 
Online_MC with UCB 1 , a classical online learning weak-regret al¬ 
gorithm HD suitably designed for IID bandits. The result in Figure 
|7]clearly shows the performance gain by using Online_MC. 



Figure 7: Online_MC v.s. UCBl 


6.3 Effects of parameter selection 

We next take a closer look at the effects of parameter selection, 
primarily with L and z. We demonstrate with Online_MC. We re¬ 
peat the above sets of experiment w/ different L, z combinations and 
tabulate the average reward per time step. From Table[T]we observe 
the selection of L is not monotonic: a smaller L incurs less explo¬ 
ration steps but more errors will be invoked at exploitation steps 
due to its less confidence in calculating the optimal strategy. On 
the other hand, a large L inevitably imposes higher burden on sam¬ 
pling and thus becomes less and less favorable with its increase. 
Similar observations hold for z since z controls the length of explo¬ 
ration phases jointly with L but with different scale. Flowever it is 
indeed interesting to observe that when z grows large enough (e.g., 
z = 5), the performance drops drastically; this is due to the fact in 
such a case more than enough efforts have been spent in sensing 
steps. 

I z.(z=l/5) I 5 I 10 I 20 I 30 I 40 I 

I Average reward | 0.3391 | 0.3522 | 0.3353 | 0.3183 | 0.3166 | 

Table 1: Diff. L (Avg. = 0.27 w/ random channel selection) 

I ;(L=10) I 1/6 I 1/5 I 1/4 | 1/3 | 1/2 | 

I Average reward | 0.3411 | 0.3522 | 0.3557 | 0.3017 | 0.1949 | 

Table 2: Diff. z ( Avg. =0.27 w/ random channel selection) 


7. DISCUSSION 

In this section we discuss several possible extensions of the cur¬ 
rent sets of results, primarily concerning the statistical assumption 
of channel evolutions. Throughout the paper, we assume the chan¬ 
nel statistics over time evolves as an IID process, though with un¬ 
known distributions and parameters. An immediate extension of 


this work is to study the online learning algorithm when such evo¬ 
lution is Markovian. For Markovian channels we need to again con¬ 
sider two categories of problems, namely rested and restless ban¬ 
dits 1271 . For rested bandit, the offline (when transition parameters 
being known) optimal solution is famously known as the Whittle’s 
index. Following similar exploration and exploitation procedures 
detailed in the current paper we can achieve an accurate enough 
estimation of all transition parameters of the bandits and thus ap¬ 
proximate the optimal indices. 

The main difficulties for restless case are due to the facts that 
even the offline strategy is not easy to obtain under this scenario, 
that is we do not have a clear target to track. Under certain set¬ 
ting, myopic policy has been shown to be optimal in one of our 
work 1281 and following procedures in RCA proposed in ini for 
learning with restless bandit we could again achieve an fairly accu¬ 
rate estimation and approach myopic sensing in an online fashion. 
However optimal solution for general stopping rule/sequential de¬ 
cision making problems with restless bandits is not yet clear at this 
moment, which is also one of our focus. 

Another interesting extension we would like to approach is to 
learn with (multiuser) interferences. A natural way of doing this 
is to combine stochastic bandit learning (for channels availability) 
with adversarial learning (for users interference). We conjecture 
similar results could be obtained while we emphasis in such case 
two types of explorations would be needed: first is the exploration 
for other users’ availability as commonly done in adversarial set¬ 
tings and the other one for exploring channels’ statistics. However 
the technical validation would NOT be trivial to detail out since 
considering multiuser effects in a sequential decision making pro¬ 
cess is known to be hard, even in a offline setting 1291 , primarily 
due to collision and interferences. 

The third aspect we concern is on the assumption that within 
the time horizon of our problem the statistical properties of chan¬ 
nels stay unchanged. However though we made such assumption 
(in order to derive bounds), the exploration nature of the learning 
algorithm in principle is designed to detect and adapt to changes 
in the statistics. We are currently looking into the problem of us¬ 
ing additional randomization techniques to enhance the adaptivity. 
Notably one of recent paper proved a sharp bound (sub-linear) for 
certain cases when such non-stationary statistical properties satisfy 
bounded variation OS). The technical difficulties in our setting 
are naturally more challenging since we not only need to track the 
change of each bandit’s mean reward, but also many other statistical 
parameters that are associated with the decision making processes. 

8. CONCLUSION 

In this paper we studied online channel sensing and transmission 
scheduling in wireless networks when channel statistics are un¬ 
known a priori. Without knowing such information we propose an 
online learning algorithm which helps collect samples of channel 
realization while making optimal scheduling decisions. We show 
our proposed learning algorithm (for both a multiuser and multi¬ 
channel model) achieves sub-linear regret uniform in time, which 
further gives us a zero-regret algorithm on average. Our claim is 
validated via both analytical and simulation results. 
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APPENDICES 

Notations 

We summarize the main notations in Table[^ 

Outline of the proofs and main results 

Due to space limitation we first sketch the main steps and results 

towards establishing the proved theorems. 


Notations 

Physical meaning 

M/<M 

number/set of users 

N/0 

number/set of channels 

S 

subset of channels 

Xim)) 

channel i’s reward (at time t) 

Ci 

cost for sensing channel i 

Pi 

access attempt rate of user i 

V{x,S) 

value function with state (x,S) 

fXi,Exi 

p.d.f./c.d.f. of channel i 

71,a 

access & sensing policies 

t,n 

system time, decision step for each t 

Rl(ll)iH{T)) 

accumulated regret up to stage II(T) 

(x{r),Sn) 

information state at n-th epoch 

L,z 

exploration parameters 

L,a 

Lipschitz parameters 


Table 3: Main Notations 


Proof of regret for R2{T) 

Lemma 6 . With sujfciently large L{>-^),yj we have, P{\E[Xj] — 
E[Xj]\ > e) < ^ , and P{\aj — aj\ > cij ■ e) < ^, P{\bj - bj\ > 
C2J ■£) 5 ; ^ I where z,c\ j,C 2 .j are positive constants. 

Based on above results we could show 

Lemma 7. At time t with sufficiently large L, and any iteration 
steps of the sorting procedure of Online.STEP 2 we have 

Consider calculating j* we have the following results. 

Lemma 8 . At time t with sufficiently large L, the error for sort¬ 
ing set S is bounded as, 

P{SfS)<N-^. 


Putting up all terms and multiple by A* we have results claimed 
in Lemma 0 

Proof of regret for R^if) 

We sketch the key steps towards getting the claim. 


Online.STEP 3.1. 

At first step of deciding whether x(n) > ai of Online.STEP 3, 
there will be no error when x < min{fli,ai} or x > max{fli,ai}. 
Consider jc falling in the middle. Make e being small enough, e = 
. As we already proved P{\d\ —ai| > cij -e) < ^. Also due 
to the relaxation of ttf, the difference between di and the true ai is 
bounded away by at most ci j • e + e. For \a\ — ai l<(ci,i + l)-e. 
the probability that x falls within the middle is bounded as 


P(3i s.t. A',(r) e [min{di,ai},max{fli,fli}]) 
e [min{ai,ai},max{di,ai}]) 


<N-\Exfdi)-FxXai)\< 


Ai:-(cia + l)“ 
ja-zll 


Online.STEP 3.2. 

We first prove the following results. 

Lemma 9. With sufficiently large L and information state (x, S), 
we have at time r Ve > 0 

P(|V(A,5)-V(x,S)|>|5|.e)<^. 

Based on above results we prove that the estimation of ds can be 
bounded by a confidence region, which we detail as follows. 


Lemma 10. With sufficiently large L and channel set S 


^ . 2|5|+3 , ^ 4 

-ds\ > -e) < — 


Cd, 

at time step f, Ve > 0, where = P{Xi < df). 


(Sketch) The proof is primarily done via analyzing the estimation 
errors from both sides of the equation 

V(0,5„) = -Cl +£[V(maxH,Ai},S„ - {1})] , 

which decides ds. For bounding the value functions we repeatedly 
use Lemma|3 Taking L > 4 and e = ^ will lead to our bounds. 


Remark 11. The above result invokes a constant =P{X\ < 
ds). If P{X\ < ds) = 0, i.e., A'i(co) > dsyat our bound is not well 
defined. In fact under this case, what really matters is the overlap¬ 
ping between [0,<7.s] and [X_j,Xj] (support of Xj). So long as the 
overlapping is bounded small enough, the decision error is again 
bounded. 


Online.STEP 3.3. 

When x(n) < ds, the optimal decision comes from one of three 
cases. For the first two cases, we have the following lemmas char¬ 
acterizing the regrets : for sub-steps Online.STEP 3.3.1, 3.3.2 there 
are possibly three decisions to make and we have their error bounded 
as follows (detailed proofs omitted) 

Lemma 12. With sufficiently large L, (1). if b\ > a 2 , P{bi < 
02 - ^ . (2). Ifb2 > bi, P{b2 < &i - 7^) < ^ . (3). If 

gi ( 0 ) > max{£'[Ai],g2(0)}, P(gi (0) < max{£[li],g2(0)}- tIt) < 

2 

7- ■ 

(Sketch) For error in bi in Online.STEP 3.3.2, the analysis is the 
same as for a\ as in Online.STEP 3.1 since we already established 
its estimation error bounds. 

For the last case in Online.STEP 3.3.3, first notice if E\X\\ = 
g 2 ( 0 ), there is no error associated with the last step since guess 
(access w/o sensing) the first channel and probe the second essen¬ 
tially return the same expected reward. Therefore we show the 
error analysis when E[X\\ f g2(0)- We then bound the error of 
estimating bo (this is similar with proving the bound for ds and 
we omit the details for proof) : with Ct,„ being certain constant, 
P{\bo — fool > < p ■ Moreover we have the following results: 

(Details for proof omitted as it is quite similar to previous ones.) At 
time t P(sign(£[li] -g 2 ( 0 )) f sign(£[Ai] -g 2 ( 0 ))) < These 
cover all parameters needed for the decision making queries. 
Putting up all terms we have results claimed in Lemma|4] 


by Lipschitz condition. Add up for all t we have a sub-linear term. 






















Proof for Lemma m 


Proof. First of all by law of large numbers with enough sam¬ 
pling we could bound the different |£'[X] — £[A']| by a positive con¬ 
stant e. Specifically by Chemoff-Floeffding bounds we have 

P{\E[X]-E[X\\>i) , 

so long as L • ^ we have the results. 

The rest of the proof can be done by proving contradictions. First 
let us assume tij > aj -f e. Since {Xj — /j)^ and (/j — Xj)^ are also 
i.i.d. for any constant /r, we know 

P{\E[{XJ-^,)+]-E[iXJ-^,)+]\>e)<^, 

P{\E[i^^-XJ)+]-E[{^,-XJ)+]\>e)<^. 

Now consider the case with E[{Xj —/j)’*"] — E[{Xj — < E. 

Then we have 

E[{Xj~aj)+]<E[{Xj~aj-cijZ)+] 

< E[{Xj —aj)'^] — (1 —P{aj < Xj < aj -\-cije))cije 

^ E^i^Xj — — (1 — Eifi j ^ "El £ 

<E[{Xj~aj)+]<cj. 

So as long as we make sure, 

{l~P{aj <Xj <aj + C\jE))cij > 1 , 
i.e., when ci ,■ > — 777 —, , we have the above holds 

’ — \-P(aj<Xj<aj+cijE) ’ 

which contradicting the optimality of cij. 

Consider the case when aj < aj — ci jE, similarly we could prove 
that with an appropriately chosen ci j we have 

dj > if [Xy] t^l.yE , 

i.e., dj-\-C\ jE > E\Xj\. And moreover 

E[{Xj - (5,- + Cl,;£))+] < E[{Xj-aj)+] < cj , 

which contradicts the optimality of aj. The proof for bj is similar 
with aj and we omit the details for a concise presentation. □ 

Proof of Lemma |7] 

Proof. First we have as long as 

ntinfl;. ~ I “ 

maxci / • e <-!-=- - — 

j 2 

there will be no error with sorting as. To see this if aj > we have 
dj ~dji > fly — aj. — cpj • e — Cl • Eaj — aif> Since 

P{mj]-mj]\ > - 2 max,ci,^-^ 

<2-e ^ , ( 8 ) 

by Chernoff-Hoeffding bound. Therefore if we have roughly (since 
^ is a much smaller term in order ) 

(9) 

2 maXj cpj 


a 0(l/r^) error is guaranteed. For aj = aj. we can similarly bound 
probability that |dj — dj | > as long as L > 4. □ 

Proof of Lemma ID 

Proof. We first prove the following results. 

Lemma 13. With sufficiently large L, V j we have, 

2 

P{dj ¥"bj) < -^ , if bj = Uj . 

Proof. Based on the definition of aj,bj when aj = bj we have 
the following hold. 

aj = E[Xj] = bj ,Cj > E[{Xj-aj)+] ,Cj > E[{bj-Xj)+] . 

Suppose we have \E\Xj\ — £[Ay]| < e (as proved in previous lemma 
with sufficiently large L) we therefore have 

E[{Xj~Em)+] < E[{Xj-E[Xj])+]+e 

<El{Xj-E[Xj])+]+2E<Cj + X^ 


E[iE[Xj] -!,■)+] < +e 

<£[(£[2(,.]-A^)+]-f2e<c,- + ^, 

as long as e < from which we have bj = dj based on the 
definition of d, is. □ 

Similar with above proof we have the following results : 
Lemma 14. Ear sufficiently large L, at time t we have Ve > 0, 


P{\ 


E[^L 


P{Xj > 




for certain constant C. 


Proof. Consider the term 


E[X]- 

P(X>S 


and we want to bound the 
estimation error associated with above terms, i.e., the probability, 

£[l,-]-c,- £[3(,-]-c,- 

>(i,->d,-) p{Xj>ajy >■ 

We need the following fact. 

|^--|<4'5,Vx,5>0. (10) 

X + O X x-^ 

For P{Xj > dj) we have 

|P(l;>d,.)-P(2f,■>«,■) I 

< |P(1,•> d,■)-P(A,■>«-,)I-FE 

< |P(1; > d^-) - > flf) I + e + • (Cl jE)“ . (11) 

The second relation comes from bounds on dj and Lipschitz con¬ 
dition of F(-). Plug in X = P(A > a^) we have 


E[X]-Cj E[X]-Cj 
' P{X > dj) P{X > aj) 


I <C-(e-FM-(ci,jE)“) . 


( 12 ) 


(9) 


for certain constant C. □ 















Denote 


Proof for Lemma [To] 


£3 = tmn \E[Xj] -Em, z, = tpn1I ■ 

Therefore when 

^ / / ,^ry rvN min{e^,e4| 

C-(e + i:.(cij + l)“e“)<-, 


there is no error with the ordering (as similarly argued in ordering 
flys). Denote a solution for above e as Eq (which is trivial to show to 
exist). Then we further require L-E > ^ , to guarantee a 0(l/r^) 
error. 


Proof. To prove this first notice the following dynamic equa¬ 
tion holds for solving ds, 

V(0,5) = -Cl +£[V(maxK,Xi},S-{l})] . 

Consider the LHS by the above results we have the probability of 
|V'(0,S) — V(0,S)| > |5|-ebeing bounded by Consider then the 
case with | V(0,5) - V(0,5) | < |5| • £ . 

For the RHS first notice 


Proof for Lemma O 


Proof. We prove by induction and the induction is based on the 
size of S. When |5| = 1 (as well as |5|; also do notice S = S due to 
the sorting algorithm we adopted. However due to the calculation 
of aj,bj and inaccurate measure of Xjs, there is still discrepancy 
between the two value functions), we have (suppose we have S = 

m 

V{x,S) = msx{—Cj~\-E\y{msK{x,Xj},<b)],x,msKE[Xj\} . 

jes 

Notice that P(max{jc,J?y},0) = max{x,Xj} , we then have, 

V{x,S) = max{—cy -|-£[max{x,Xj}],x,max£[Xy]} . 

jes 

Since with probability at least 1 — ^ we have, 

|£[max{x,Xy}] -£[max{x,X^-}]| < e, |£[ 1 ;] -£[Xj]| < e , 
we know w.h.p. 

|V(x,5)-V(x,S)|<e, (13) 

since each term in the max function is bounded within the e-confidence 
region. We therefore established the induction basis. Now suppose 
this is true for |5| = k,k < N. Consider the case with |5| = k + 1. 
Based on the dynamic programming equations we know, 

V{x,S) = max{max{—cy-f£[V’(max(x,Xj),5 —_/)]},x,max£[i?j]} . 
jeS jes 

By induction hypothesis we know with probability at least 1 — 

\V{max{x,Xj),S- j) - V{max{x,Xj)^S- j)\ <k e, 

and the fact —E[Xj] | < E . Therefore 

V{x,S) < max{max{—cy-t-£'[V'(max(x,Xj),S —y)]} 
jes 

,x, max£’[Xj]}+ ^: ■ E (14) 

yes 

Also it is easy to notice withxand 5 —y being fixed, V {max(x,Xj),S — 
j) is also IID w.r.t. Xj. Then we have (via Chernoff-Hoeffding 
bound) |V(max(x,Xj),5 — y) — V(max(x,3(y),S — y)| <£ . There¬ 
fore 

V{x,S) < max{max{—c; +£'[V(max(x,Xj),S —y)]} 

ye5 

,x,max} + (^-f 1) •£ = V(x,5) -f (fc-f 1) - E . 

yes 

The other side of the inequality could be similarly proved and we 
finished the proof. □ 


|£[P(max{J„li},S-{l})]-£[V(maxH,li},S-{l})]| < \S\-E. 

We next show there exits a 8 = 0(e) such that < 5 . We 

prove this by contradiction. Suppose > ds + 5. We first would 
like to show the following 

E[V{max{ds,Xi},S-{l})] 
-£[V(maxH,Ai},S-{l})] >-|S|-£ (15) 

To see this first notice 

£:[i/(maxH,li},S-{l})]-£[V(maxH,Xi},S-{l})] 

= (£[\/(maxH,li},S-{l})]-£[i/(maxH,li},S-{l})]) 
-f(£[y(maxK,li},S-{l})]-£[V'(maxK,Xi},S-{l})]) . 
Since 

£[V(maxH,li},S-{l})] 

-£[V(maxK,Xi},5-{l})] >-(|SH-l)-£, (16) 

it is sufficient to prove that 

£[V(maxH,li},5-{l})] 

-£[V(maxK,li},S- {1})] > (2|S| -t-1) ■£ . (17) 

Notice 

£[i/(maxK,li},S-{l})]-£[i/(maxH,li},S-{l})] 
>£[V(max{J„Xi},S-{l})]-£[\/(maxK,Xi},S-{l})]-2E 
>Crf, •8-2 e, 

where is a positive constant. Therefore select 5 large enough 
such that ■ 5 > (2|S| -f 3) • E we finish the proof. Similarly we 
can prove the case for d^ <ds — 5. We finish the proof. □ 

Proof for Online.STEP 3.3.3 

On the sign o/£[li] -g 2 ( 0 ). 

Since the two cases with the sign are symmetric we will only 
prove the case when £[ 2 ( 1 ] — g 2 ( 0 ) > 0. Since 

g 2 ( 0 ) = -C 2 -f£[V'(max(li, 0 ),l)] 

= -C2+£[V(1i,1)]<-C2+£[V(Xi,1)]+£, 

and E[X{\ > £[2(i] — E. Therefore as long as E > we 

proved the claim. 

On bo. 

The proof is similar with the one for ds : bounding the estimation 
error for equations leading to the solution of bo. Since bo satisfy 
the following equality; 

gl(fio) = max{£[Xi],g 2 ( 0 )} . (18) 






(19) 


Consider LHS gi {bo). lfbQ>bo+8 we have 
gl(*o) >gl{bo)+Cb„ •5-e . 
for certain constant Cj,„ ■ Consider RHS we have 

|max{£:[li],g 2 ( 0 )}-max{£:[Xi],g 2 ( 0 )}| <£ . 
Therefore if 8 > ^ we arrive at contradiction. 

Cio 


( 20 ) 


